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Abstract 

This  study  explores  the  idea  of  building  a  library  of  VHDL  configurable  compo¬ 
nents  for  use  in  digital  radar  applications.  Configurable  components  allow  a  designer 
to  choose  which  components  he  or  she  needs  and  to  configure  those  components  for  a 
specific  application.  By  doing  this,  design  time  for  ASICs  and  FPGAs  is  shortened  be¬ 
cause  the  components  are  already  designed  and  tested.  This  idea  is  demonstrated  with 
a  configurable  dynamic  pipelinable  fast  fourier  transform.  Many  FFT  implementa¬ 
tions  exist,  but  this  implementation  is  both  configurable  and  dynamic.  Pre-synthesis 
customization  allows  the  FFT  to  be  tailored  to  almost  any  DSP  application,  and  the 
dynamic  property  allows  the  FFT  to  calculate  different  length  FFTs  real-time.  Three 
objectives  will  be  accomplished:  design  and  characterization  of  the  aforementioned 
FFT;  analysis  of  the  error  involved  in  the  FFT  calculation  using  different  twiddle 
factor  bit  widths;  and  finally  an  analysis  of  all  the  configurations  for  the  synthesized 
design  using  a  90  nm  technology  library.  Speeds  of  up  to  225  MHz  have  been  simulated 
for  a  length-1024  FFT  using  the  90  nm  technology. 
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A  Modular  Mixed  Signal 
VLSI  Design  Approach 
for 

Digital  Radar  Applications 

I.  Introduction 

Since  at  least  the  year  2000  the  Pentagon  has  seen  a  need  for  highly  advanced 
Electronic  warfare  (EW)  aircraft.  The  Pentagon  published  a  Kosovo  after-action 
report  to  Congress  discussing  how  NATO  forces  had  difficulty  in  targeting  missile 
sites  [8].  Also,  a  separate  report  said  the  problems  included  interference  from  other 
aircrafts’  jammers  with  friendly  targeting  devices.  These  reports  preempted  Congress 
to  begin  a  study  in  ways  to  improve  EW.  Billions  of  dollars  are  spent  researching  and 
developing  newer  and  more  advanced  radar  systems.  In  addition  to  the  high  costs, 
design  and  development  time  can  take  months  even  up  to  years. 

In  most  radar  systems  digital  signal  processing  (DSP)  is  used  extensively.  DSP 
is  the  study  of  signals  in  a  digital  representation  and  the  processing  methods  of  these 
signals.  The  main  goal  of  DSP  is  to  filter  to  measure  real-time  analog  signals.  An 
analog-to-digital  converter  (ADC)  is  used  initially  to  transform  analog  signals  used  in 
radar  communications  into  digital  signals.  Many  types  of  filters  and  transforms  are 
used  in  DSP.  These  functions  are  implemented  in  some  type  of  Application  Specific 
Integrated  Circuit  (ASIC).  A  general  conventional  design  flow  for  ASICs  is  as  follows: 

1.  Functional  Specifications 

2.  Design  Partitioning 

3.  RTL  (RTL)  Design  &  Simulation 

4.  Functional  Verification 

5.  Synthesis  for  Area  &  Timing  Optimizations 
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6.  Placement  &  Routing 

7.  Chip  Fabrication 

This  design  flow  is  limited  due  to  the  length  of  time  needed  to  make  an  ASIC. 
The  RTL  and  Simulation  process  (Step  #3)  itself  can  take  many  weeks  or  months 
to  complete,  depending  on  the  complexity  of  the  design.  The  design  flow  is  also 
limited  by  the  high  costs  associated  with  it  and  the  ASIC’s  limited  flexibility.  To 
solve  this  problem,  a  speedy  and  adaptable  design  flow  will  be  proposed  by  placing 
pre-defined  modular  components  into  a  library.  This  library  will  consist  of  highly 
customizeable  and  configurable  codes  of  DSP  functions  that  can  target  either  ASICs 
or  field  programmable  gate  arrays  (FPGA)  to  produce  circuits  to  suit  the  intended 
applications.  The  development  of  this  library  will  be  a  time  consuming  process  in 
itself,  but  once  the  library  is  complete  all  a  designer  must  do  is  pick  and  choose 
which  components  from  the  library  he  or  she  wants  to  use.  The  components  will 
be  configurable  so  there  will  be  limitless  design  possibilities.  Performing  this  work 
in-house  will  save  the  Department  of  Defense  (DoD)  from  having  to  out-source  to 
companies  such  as  Boeing  or  Raytheon,  who  could  charge  millions  of  dollars  to  produce 
such  a  product. 

1.1  Specific  Issue: 

DSP  is  an  extremely  important  function  in  radar  applications.  The  processing 
of  digital  data  must  be  performed  as  fast  as  possible  so  the  warfighter  has  the  ad¬ 
vantage  in  any  combat  situation.  One  such  component  in  DSP  is  the  Fast  Fourier 
Transform  (FFT).  The  FFT  is  an  algorithm  for  converting  a  digital  signal  in  the 
time  domain  to  a  signal  in  the  frequency  domain.  One  of  the  original  uses  of  the 
FFT  was  to  distinguish  between  nuclear  explosions  and  natural  seismic  events.  These 
two  phenomena  produce  different  frequency  spectra.  By  converting  the  signals  to  the 
frequency  domain  a  distinction  between  the  two  events  could  be  seen.  Aircraft  have 
different  radar  signatures,  so  by  using  the  FFT  on  the  radar  signals  received  the  pilot 
can  see  the  aircraft’s  location  and  speed.  In  this  research  a  configurable  FFT  will  be 
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developed  for  the  aforementioned  library.  In  addition,  this  FFT  will  be  dynamic;  i.e. 
it  will  be  able  to  calculate  different  length  FFTs  real-time.  This  implementation  will 
be  using  a  90  nm  technology  library  from  the  Taiwan  Semiconductor  Manufacturing 
Company  (TSMC).  Results  from  this  library  will  be  compared  to  those  of  the  AMI 
350  nm  library  from  Oklahoma  State  University.  The  90  nm  technology  will  provide 
for  faster  speeds  and  lower  power  consumption  compared  to  those  of  the  350  nm 
library. 

1.2  Problem  Statement: 

The  problem  to  be  solved  is  the  demonstration  of  a  modular  digital  radar  library 
by  designing  and  characterizing  one  possible  component.  The  FFT  being  designed 
will  be  both  configurable  and  dynamic.  The  configurable  parameters  can  be  changed 
pre-synthesis  and  the  dynamic  parameters  can  be  changed  at  run-time.  To  keep  the 
chip  size  small  and  power  consumption  low,  a  minimal  hardward  approach  will  be 
used.  This  will  result  in  a  longer  design  time  for  each  component  in  the  library  but 
will  allow  for  the  most  efficient  design. 

1.3  Scope  and  Assumptions: 

It  is  assumed  that  readers  of  this  paper  will  have  a  basic  understanding  of  dig¬ 
ital  signal  processessing  and  more  specifically  FFTs.  Additionally,  strong  knowledge 
of  the  Very-High-Speed  Integrated  Circuit (VHSIC)  Hardware  Descriptor  Language 
(VHDL)  is  required  to  understand  the  coding  of  the  design.  The  software  used  for 
this  research  includes  Modelsim  for  circuit  simulations,  MATLAB  for  simulations  and 
error  analysis,  and  the  Cadence  Encounter  RTL  compiler  for  synthesis  and  power, 
timing,  and  area  analysis.  A  knowledge  of  simple  digital  logic  components  is  also 
assumed.  Such  components  include  muxes,  adders,  and  pipeline  registers. 
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1.4  Thesis  Organization: 

The  next  chapter  of  this  research  project  will  discuss  background  information 
necessary  to  understand  the  scope  of  this  project.  A  discussion  of  the  mathematics 
and  algorithms  for  the  FFT  is  included.  Additionally  several  current  (within  the  past 
5  years)  FFT  implementations  will  be  analyzed  and  their  results  discussed.  Chapter 
III  will  consist  of  the  theory  involved  in  the  design  of  the  FFT  architecture  and  the 
methods  of  testing  used.  The  results  of  the  implementation  and  characterization 
will  be  discussed  in  chapter  IV.  Finally,  a  review  and  a  look  at  future  topics  will  be 
discussed  in  chapter  V.  All  VHDL  code  will  be  viewable  in  the  appendices. 

1.5  Chapter  Summary: 

The  purpose  of  this  research  project  is  to  characterize  and  implement  an  FFT 
component  for  use  at  the  Air  Force  Research  Lab  (AFRL).  Pending  a  successful 
demonstration  of  this  component,  the  component  will  be  included  in  a  future  library 
of  many  configurable  DSP  functions,  saving  the  U.S.  military  millions  of  dollars  in 
addition  to  many  months  of  design  time. 
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II.  Background 

This  chapter  provides  an  overview  of  the  research  involved  in  understanding  the 
scope  of  this  thesis.  Fourier  transforms  and  specifically  FFTs  are  reviewed. 
Two  FFT  algorithms  are  discussed,  as  one  will  be  used  in  the  FFT  implementation. 
Many  FFT  implementations  are  available  in  the  IEEE  database.  Several  of  the  most 
recent  implementations  and  their  claimed  results  are  analyzed. 

2. 1  Fourier  Transform 

To  understand  the  derivation  and  need  for  the  Fast  Fourier  Transform,  we  will 
first  look  at  the  Fourier  Transform.  The  Fourier  transform  and  series  are  named 
after  the  French  scientist  and  mathematician  Joseph  Fourier.  The  equation  for  the 
Fourier  Transform  is  given  in  (l2.1]h  It  is  a  generalization  of  the  complex  Fourier 
Series  [4].  This  equation  takes  a  signal  in  the  time  domain  and  transforms  it  into  the 
frequency  domain.  Information  such  as  frequency  range  and  energy  can  be  obtained 
from  the  frequency  domain  representation.  Figure  1211  shows  an  example  of  such  a 
transformation. 


*(/) 


(2.1) 


(a) 


(b) 


Figure  2.1:  Signal  representation  in  (a)  time  domain  (b)  frequency  domain 
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2.2  Discrete  Fourier  Transforms 

The  Fourier  Transform  worked  on  continuous  signals.  In  fields  such  as  signal 
processing,  signals  are  usually  sampled.  These  sampled  signals  are  called  discrete 
signals.  To  calculate  the  Fourier  Transform  of  a  discrete  signal,  the  Discrete  Fourier 
Transform  (DFT)  is  used.  James  Tsui  explains  the  two  limitations  of  the  continuous 
Fourier  transform  in  his  book  [21]: 

First,  the  function  in  the  time  domain  must  be  representable  in  closed 
form  so  that  the  Fourier  integral  can  be  performed.  Thus,  unless  the 
input  function  can  be  written  in  closed  form,  it  is  impossible  to  evaluate 
the  integral.  Second,  even  if  the  time  function  can  be  written  in  closed 
form,  it  might  also  be  difficult  to  find  a  closed-form  solution  to  the  integral. 

The  data  to  be  transformed  comes  from  an  ADC,  so  it  is  digitized  and  the  function 

in  the  time  domain  is  unknown.  Unlike  the  Fourier  transform,  the  DFT  can  be 

performed  on  any  kind  of  input  data;  therefore,  its  usage  is  unlimited  [21],  Also,  the 

results  from  a  DFT  are  an  approximate  solution. 

The  general  definition  of  the  DFT  is  as  follows:  let  x(n),n  =  0, 1,2....,  A”  —  1, 

be  an  N-point  sequence.  From  [18],  the  definition  of  its  discrete  Fourier  transform  is 

N- 1 

X(k)  =  xin)e~j%nk,  k  =  0, 1,  2, ...,  N  -  1  (2.2) 

n= 0 

For  convenience,  denote  e~^nk  by  Wn,  so  equation  (2.2)  becomes: 

N- 1 

X(k)  =  x(n)W^n,  k  =  0, 1,  2, ...,  N  -  1  (2.3) 

71=0 

which  can  be  expanded  into 

X(k)  =  x(0)W^  +  z(l)W&  +  x{2 )Wlk  +  ...  +  x(N  -  1  )W{NN~1)k  (2.4) 

The  Wn  term  is  called  the  nth  root  of  unity,  also  known  as  a  “twiddle  factor”.  This 
term  was  coined  by  Gentleman  and  Sande  in  1966,  and  has  since  become  widespread 
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in  the  world  of  FFTs  [9].  From  equation  (]2.4}h  the  calculation  of  each  X{k)  requires 
N  complex  multiplications  and  N  complex  additions.  Since  X(k)  is  calculated  from  0 
to  N-l,  the  direct  computation  of  the  DFT  requires  on  the  order  of  N 2  multiplications 
and  N 2  additions.  The  complexity  of  this  equation  is  0(N2).  For  example,  a  1024 
point  DFT  would  require  approximately  2,097,152  operations!  Fortunately  there  is 
an  algorithm  which  will  reduce  the  complexity  from  0(N2)  to  0(Arlog2  N).  For  the 
same  1024  point  FFT,  only  about  20,480  operations  will  be  needed,  a  large  decrease 
which  means  a  faster  calculation.  This  algorithm  is  called  the  Fast  Fourier  Transform. 

2.3  Fast  Fourier  Transforms 

In  1805,  Carl  Friedrich  Gauss  describes  the  critical  factorization  steps  for  the 
FFT.  Almost  150  years  later  in  1965  James  Cooley  and  John  Tukey  formally  publish 
the  algorithm  for  the  FFT  [7] .  They  exploited  the  symmetrical  properties  of  complex 
exponentiation  reducing  the  complexity  to  N  log2  N.  There  are  two  variations  of  the 
FFT  algorithm,  the  Decimation-In-Time  (DIT)  FFT  algorithm  and  the  Decimation- 
In-Frequency  (DIF)  FFT  algorithm. 

2.3.1  Decimation-In-Time  FFT.  Cooley  and  Tukey,  using  the  Danielson- 
Lanczos  Lemma  from  1942  [16],  developed  what  is  known  as  the  decimation-in-time 
FFT  algorithm.  This  lemma  is  only  applicable  if  the  length  is  a  power  of  2.  From 
now  on,  we  will  assume  N,  the  length  of  the  transform,  is  a  power  of  2.  The  allowable 
lengths  in  the  VHDL  implementation  range  from  4,  8,  16,  ...,  up  to  1024.  For  the 
DIT  algorithm,  x(n)  is  divided  into  two  sequences,  each  of  length  N/2.  The  even- 
indexed  samples  and  odd- indexed  samples  are  grouped  separately.  Equation  (12.2)  can 


7 


be  rewritten  as 


N- 1 


x(k )  =  XI 

n= 0 

f-i 

=  x  x(2n ) 


2nnk 


2n(2n)k 
e'J'  N 


X  ^(2n  +  1)' 


-j- 


n=0 
AT  , 


2nnk 
-i  iv- 


n=0 

2nk  ^-i 


2n(2n  +  l)fc 

Iv 

2imk 


_  x ,  _  _  ■?  jv 

=  X  ^(2n)e  "2  +  e  -/V  ^  x(2n  +  l)e  ~2~  (2.5) 

n=0  n=0 

=  DFT jy  [[x(0),  x(2), x(7V  -  2)]]  +  WkNDFTN  [[x{l),  x(3), x(iV  -  1)]] 

~2  ~2 


The  simplifications  in  equation  (12.5)  show  that  all  frequency  outputs  X(k)  can 
be  computed  as  the  sum  of  the  outputs  of  two  length  x  DFTs,  using  the  even- 
indexed  and  odd-indexed  discrete  samples  respectively.  The  odd-indexed  short  DFT 
is  multiplied  by  a  “twiddle  factor”  term,  Wjy.  Because  the  samples  are  split  into  two 
separate  groups,  this  algorithm  is  called  a  “radix-2”  algorithm.  Other  such  algorithms 
exist  for  radix-4  and  radix-8,  but  will  not  be  discussed  in  this  paper.  Since  the  time 
samples  are  rearranged  in  alternating  groups,  this  algorithm  is  called  decimation  in 
time.  Figure  12.21  shows  how  this  process  begins  by  breaking  the  inputs  up  into  two 
N/2  DFTs.  The  recombine  stage  shown  in  the  figure  is  used  to  combine  the  samples 
in  the  correct  order.  This  process  is  covered  later.  Now,  the  two  N/2  stages  can 
be  broken  down  into  four  iV/4-point  DFS,  as  shown  in  Figure  12.31  This  process  is 
repeated  until  a  series  of  two-point  DFTs  are  reached.  Figure  12.41  shows  the  flow 
graph  for  a  two-point  FFT.  This  structure  is  also  known  as  a  butterfly. 

Figure (20 shows  an  example  for  a  length  of  8.  Notice  the  “out-of-order”  ordering 
of  the  inputs.  Actually,  this  is  bit-reversed  ordering,  and  is  a  natural  process  due  to 
the  mathematics  of  the  FFT.  To  obtain  a  bit  reversed  number  simply  take  the  binary 
equivalent,  reverse  the  order  of  the  bits,  and  recalculate  the  decimal  equivalent  from 
that.  Table  12.11  shows  how  the  numbers  are  bit-reversed. 

This  process  also  allows  for  in-place  computation,  which  means  the  results  of 


Figure  2.2:  Decimation- in-time  of  a  length  N  DFT  into  two  length  N/2  DFTs 

followed  by  a  recombining  stage.  p2] 

the  calculations  at  any  stage  can  be  stored  in  the  same  memory  locations  as  those 
of  the  input  to  that  stage.  This  idea  is  illustrated  in  Figure  [2t5l  The  calculations  of 
X (0)  and  X (4)  require  the  same  two  inputs.  Once  this  calculation  is  complete  the 
two  inputs  are  no  longer  needed,  so  the  calculated  butterfly  values  of  X(0)  and  X(4) 
can  be  stored  in  the  memory  locations  of  A"(0)  and  X(4).  Because  of  this,  only  2 N 
storage  locations  are  needed. 
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Figure  2.3:  Decimation-in-time  of  a  length  N  DFT  into  four  length  N/ 4  DFTs 

followed  by  a  recombining  stage.  p5] 


Figure  2.4:  Flow  graph  for  computation  of  a  two-point  DFT.  |18] 
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Figure  2.5:  Decimation-in-time  of  a  length  8  DFT.  |I8] 


Table  2.1:  Bit-reversed  order  for  N=8. 


Decimal 

Number 

Binary 

Representation 

Bit-Reversed 

Representation 

Decimal 

Equivalent 

0 

000 

000 

0 

1 

001 

100 

4 

2 

010 

010 

2 

3 

011 

110 

6 

4 

100 

001 

1 

5 

101 

101 

5 

6 

110 

Oil 

3 

7 

111 

111 

7 

11 


2.3.2  Decimation-In-Frequency  FFT.  Two  gentlemen  by  the  name  of  Sande 
and  Tukey  developed  the  decimation-in- frequency  algorithm  [19].  The  DIF  algorithm 
works  backward  from  the  DIT  algorithm.  Instead  of  dividing  the  input  sequence  x(ri) 
into  smaller  subsequences,  the  output  sequence  X(k)  is  subdivided.  The  algorithm 
consists  of  arranging  the  DFT  into  two  parts:  calculation  of  the  even-numbered  fre¬ 
quency  indices  X(k)  for  k  =  0,  2,4,  ...,7V  —  2  and  calculation  of  the  odd-numbered 
frequency  indices  k  =  1,3,5, ...,  N—  1,  or  X(2 r)  and  X(2r  +  1),  respectively.  We  have 


N—l 


X (2r)  =  x{n)W1 


2rn 

N 


n=Q 


N  2r(n+f ) 


Y  x(n)W?r  +J^x(n+  -)W" 


n=0 


n=0 


TV, 


Y  x(n)w?r  +  Y  <n + Yw*ni 


n= 0 

JV_i 

2 


n=0 


E_/  V 

(x(n)  +  x(n  +  — ))VF 


rn 

N 


71=0 


TV 


=  DFTn  (x(n)  +  x(n  +  — )) 

2  Z 


(2.6) 


and 

N-l 

X (2r  +  1)  =  x(n)W%r+1)n 

72=0 

f_1 

=  !>(«)  +  z(n  +  -t))WNr+1)n 

72=0 

=  £((*(«)  -  xl-n  +  y))W«)w,f  (2-7) 

71=0 

N 

=  DFTn(x(u)  -  x(n  +  — )W£) 

2  2 

Notice  only  the  odd-indexed  frequencies  are  multiplied  by  the  twiddle  factors.  Also 
the  frequency  samples  are  computed  separately  in  alternating  groups,  hence  the  dec- 
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imation  in  frequency  designation.  The  inputs  of  the  DIF  FFT  are  in  order  and 
the  outputs  are  now  in  bit-reversed  order,  opposite  of  the  DIT  algorithm.  It  is  for 
this  reason  the  DIF  algorithm  is  chosen  for  the  VHDL  implementation.  Either  way, 
re-ordering  hardware  is  necessary  to  arrange  the  data  before  or  after  the  FFT  calcu¬ 
lation.  Figure  [2761  shows  the  first  stage  with  the  FFT  being  split  into  two  N/2  DFTs. 
These  N/2  DFTs  are  broken  down  until  a  length- two  DFT  is  found.  This  is  shown 
in  Figure  [2771 


Figure  2.6:  DIF  of  a  length  N  DFT  into  two  length  N/2  DFTs.  |18] 


2.4  VHDL 

With  the  background  and  mathematics  for  a  FFT  in  place,  a  vehicle  to  create 
the  FFT  circuit  will  now  be  discussed.  There  are  several  high-level  languages  which 
can  be  used  to  describe  a  digital  circuit.  VHDL  is  a  popular  design  entry  language  for 
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Figure  2.7:  DIF  of  a  length  8  DFT.  |18] 


FPGAs  and  ASICs.  Another  popular  language  is  Verilog.  One  advantage  VHDL  has 
over  Verilog  is  the  ability  to  use  generate  statements.  Generate  statements  are  used  to 
include  many  concurrent  VHDL  statements.  In  a  modular  design,  generate  statements 
will  be  used  heavily  to  create  the  module  using  the  least  amount  of  transistors,  thus 
reducing  power,  timing,  and  cell  area.  Once  the  VHDL  code  has  been  tested  for 
errors  and  the  simulations  are  correct,  the  next  step  is  synthesis.  Synthesizing  takes 
the  high-level  description  and  produces  a  gate  netlist.  The  gate  netlist  is  generated 
by  the  Cadence  Encounter  RTL  Compiler  software  and  uses  cells  from  the  TSMC  90 
nm  library. 


2.5  Other  FFT  Implementations 

By  looking  at  other  FFT  implementations  one  can  get  an  understanding  of 
what  technologies  were  used,  what  the  targeted  results  are,  and  any  other  novel 
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ideas  in  an  FFT  design.  Performing  a  search  in  the  IEEE  Xplore  online  database, 
the  Opencores  website  (www.opencores.org),  or  Google  on  FFTs  will  show  many 
different  implementations  of  FFTs  in  VHDL  and/or  Verilog.  In  {23]  a  reconhgurable 
FFT  which  can  compute  lengths  from  4  to  1024  is  discussed.  The  author  uses  a 
radix-2  FFT  algorithm.  An  overall  view  of  the  architecture  is  shown  in  Figure  12.81 


Figure  2.8:  Overall  architecture  of  reconhgurable  FFT  processor  {23] 


The  butterfly  block  (BB)  carries  out  the  butterfly  calculations.  Twiddle  factors 
are  stored  in  memory,  called  the  coefficient  memory  cluster  (CMC).  The  module  stores 
512  coefficients,  enough  to  satisfy  the  requirements  of  a  length  1024  FFT.  The  512 
twiddle  factors  are  divided  into  64  smaller  modules  called  coefficient  memory  modules 
(CMM),  with  each  module  storing  8  values.  Different  coefficient  sets  are  obtained  by 
combining  various  CMMs.  One  set  of  CMMs  will  provide  twiddle  factors  for  a  length 
16  FFT.  For  larger  lengths,  CMMs  are  combined  together  to  form  a  larger  memory. 
The  DMC,  or  data  memory  cluster,  is  composed  of  two  512x32-bit  memories,  giving 
a  total  of  1024  memory  locations.  The  address  generation  block  (AGB)  generates 
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addresses  for  both  the  DMC  and  CMC.  These  reconhgurable  modules  can  compute 
addresses  for  different  FFT  lengths.  The  data  switch  (DS)  routes  the  butterfly  cal¬ 
culations  to  the  correct  DMC  modules,  using  the  addresses  which  are  determined  by 
the  address  switch  (AS).  The  CB,  or  control  block,  contains  counters  which  generate 
addresses  and  timing  for  the  entire  design.  The  author  used  the  Verilog  language 
to  design  the  processor,  and  the  design  was  synthesized  to  the  UMC  0.18/im  CMOS 
standard  cell  library  with  the  Synopsys  Design  Compiler  [23] .  Table  12.21  shows  the 
power  and  area  results  after  synthesis.  The  area  is  constant  because  this  processor 
is  able  to  compute  FFTs  of  length  16  through  1024.  With  each  increase  in  FFT 
length,  the  consumed  power  increases.  This  is  expected  because  more  calculations 
are  performed  with  larger  length  FFTs. 


Table  2.2:  Power  and  Area  Results  [23] 


FFT  Size 

16 

32 

64 

128 

256 

512 

1024 

Power 

Consumption 

(mw) 

4.7 

7.9 

8.3 

13.0 

26.1 

49.7 

81.6 

Area  (mm2) 

2.9 

In  articles  [TO] ,  [H],  and  [20]  a  pipelinable  FFT  architecture  is  presented.  This 
type  of  architecture  will  ultimately  be  used  in  the  design  of  the  configurable/dynamic 
FFT  proposed  in  this  research.  As  such,  the  overall  design  will  not  be  discussed  until 
a  later  chapter.  Pipelining  the  FFT  processor  allows  for  faster  speeds  to  be  achieved. 
In  [H]  the  author  designs  a  length  1024  FFT  in  VHDL,  and  synthesizes  with  a  0.5 fim 
Complementary  Metal-Oxide  Semiconductor  (CMOS)  technology.  Speeds  of  about 
20  MHz  have  been  achieved.  This  design  is  not  configurable  and  can  only  calculate 
a  length  1024  FFT.  Another  design  following  the  pipeline  architecture  is  mentioned 
in  [20].  Again,  this  implementation  is  limited  to  a  length  of  1024.  This  author 
uses  Handel-C  to  implement  the  FFT  processor.  Handel-C  is  a  direct  C-to-hardware 
language,  and  can  be  synthesized  directly  to  high  density  FPGA  devices  from  Altera 
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or  Xilinx  [T].  It  is  based  on  the  C  programming  language.  The  author  reports  a  speed 
of  82  MHz  for  a  1024-point  FFT. 

2. 6  Chapter  Summary 

The  background  of  the  FFT  was  discussed,  along  with  several  algorithms  used 
to  compute  FFTs.  Several  current  FFT  implementations  and  their  results  were  also 
briefly  examined.  These  results  will  be  compared  to  the  results  of  this  research. 
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III.  Methodology 

The  methodology  for  the  dynamic  configurable  FFT  will  now  be  discussed.  The 
goal  of  this  research  is  to  show  the  feasibility  of  creating  a  library  with  many 
configurable  DSP  modules.  The  design  to  be  implemented  and  demonstrated  is  a 
configurable  dynamic  FFT.  The  design  for  the  FFT  architecture  is  based  on  fI0|,  pT], 
and  [2D].  Information  on  twiddle  factors  is  found  in  [3].  The  overall  design  is  discussed 
first,  followed  by  a  detailed  analysis  of  the  major  components  found  in  the  architecture. 
Minor  components  such  as  muxes  and  shift  registers  are  assumed  to  be  known,  so  their 
design  will  be  omitted.  One  of  the  main  reasons  for  creating  configurable  components 
is  to  be  able  to  take  a  generic  component  and  conform  it  to  a  specific  application. 
These  configurable  parameters  are  processed  before  synthesizing.  The  configureability 
of  the  FFT  is  shown  in  Table  I3.ll 

Table  3.1:  Configurable  parameters. 


Parameter 

Description 

input  _width 

bit  width  of  real  and  imaginary  parts  of  input  data 

output_width 

bit  width  of  real  and  imaginary  parts  of  output  data 

tLwidth 

bit  width  of  real  and  imaginary  parts  of  twiddle  factors 

log2N 

maximum  length  of  FFT  ( Zog2  of  length);  integer  between  2  and  10 

rl-rlO 

radix  position  of  fixed  rounders 

cl-clO 

clipping  required  for  each  fixed  rounder 

ri 

input  register;  none  or  plr  (pipeline  register) 

ro 

output  register;  none  or  plr  (pipeline  register) 

pi 

pipeline  FFT;  yes  or  no 

rt 

register  reset  type;  none,  synch,  or  asynch 

nc 

number  of  different  lengths;  integer  between  1  and  8 

3.1  Overall  Design 

The  inputs  and  outputs  for  the  FFT  are  shown  in  Table  13.21  The  FFT  imple¬ 
mentations  is  based  on  the  Radix-22  Single-path  Delay  Feedback  (R22SDF)  architec¬ 
ture,  and  uses  the  DIF  algorithm.  Input  data  is  processed  in-order  and  the  output 
is  produced  in  bit-reversed  order.  The  FFT  receives  N/2slze  complex  inputs  sequen- 
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Table  3.2:  Inputs  and  Outputs. 


Parameter 

Description 

dJn_r 

real  part  of  complex  data  input 

djnj 

imaginary  part  of  complex  data  input 

q_out_r 

real  part  of  complex  data  output 

q_out_i 

imaginary  part  of  complex  data  output 

frameJn 

framing  control  signal;  forces  FFT  calculation  to  begin  on  next  input  sample 

frame_out 

framing  control  signal;  next  output  is  result  of  new  FFT  calculation 

clock 

rising  edge  sensitive  clock  control  signal 

reset _n 

active  low  control  signal  for  reset 

size 

dynamic  control  signal  selects  length  of  current  FFT;  length  = 

tially  and  the  first  output  sample  appears  after  N/2slze  —  1  samples.  The  size  signal 
controls  the  dynamic  property  of  the  FFT.  Changing  this  input  changes  the  current 
size  of  the  FFT.  An  overview  of  the  architecture  is  shown  in  Figure  [371T  The  BF2I 
and  BF2II  are  the  butterfly  modules.  The  BF2I  is  the  typical  module  which  was  de¬ 
scribed  earlier,  and  the  BF2II  is  essentially  the  same  except  it  takes  into  account  the 
—j  twiddle  factors  and  computes  them  automatically.  The  boxes  above  each  butterfly 
module  are  shift  registers,  with  the  number  inside  the  box  describing  how  many  shifts 
it  performs.  The  Wl{n)  through  WA{n)  variables  are  the  twiddle  factors.  These  are 
multiplied  with  the  output  of  the  BF2II  module  and  passed  to  the  next  BF2I  module. 
The  twiddle  factors  are  approximated  and  stored  in  a  ROM.  Each  butterfly  module 
has  control  signals  which  determine  the  calculation  the  module  performs.  These  sig¬ 
nals  are  generated  from  a  timing  controller.  In  addition  to  the  control  signals,  the 
timing  controller  generates  the  addresses  for  the  twiddle  factor  ROMs.  The  length 
of  the  FFT  in  Figure  13. 1  is  256.  If  one  wanted  to  compute  a  128  length  FFT,  the 
same  architecture  would  be  used.  The  differences  would  be  the  shift  registers  would 
be  halved  (i.e.  128  stage  shift  register  becomes  a  64  stage  shift  register)  and  the  final 
BF2II  module  would  be  bypassed. 
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Figure  3.1:  R22SDF  FFT  Architecture  for  A=1024  [20] 


3.2  Butterfly  Modules 

Two  butterfly  modules  are  used,  a  BF2I  and  BF2II.  On  the  first  N/2  cycles, 
the  2-to-l  multiplexors  in  the  first  butterfly  module  ( BF2I )  are  set  to  ’0’  and  the 
module  is  idle.  The  input  data  is  shifted  into  the  shift  registers  until  they  are  filled. 
On  the  next  N/2  cycles,  the  butterfly  module  computes  an  N/2-point  DFT  with  the 
input  data  and  the  data  stored  in  the  shift  registers.  The  following  equations  describe 
the  operations: 


Z(n )  =  x(n) 

Z(n  +  N/2)  =  x(n  +  N/2 ) 

Z{n)  =  x(n)  +  x{n  +  N/2) 
Z(n  +  N/2)  =  x(n)  —  x(n  +  N/2) 


0  <  n  <  N/2 
0  <  n  <  N/2 
N/2  <  n  <  N 
N/2  <n<  N  (3.1) 


A  physical  implementation  of  the  BF2I  module  is  shown  in  Figure 


Zr(n+N/2) 

ZHn+N/2) 

Zr(tt) 

Zifn ) 


Figure  3.2:  BF2I  Module  [3] 
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The  BF2II  module  is  similar  in  operation,  except  it  takes  into  account  the  trivial 
twiddle  factor  multiplication  of  —j.  The  following  equations  describe  the  operations: 


Z{ri)  =  x{ri) 

Z{n  +  N/2)  =  x(n  +  N/2) 

Zrealijl)  %real{N)  “1“  T  N/2) 

Zimagijl)  %imag(N)  %real{jl  T  N/2) 


Zrealij^  T  N / 2)  %real{jl)  ^imagij^  T  N/2) 

Zimag  (n  +  N/2) 

%imag  (n)  +  xreai(n  +  iV/2) 
Z(n)  =  x(n) 

Z(n  +  iV/2)  =  x(n  +  N/2) 

Z(n )  =  x(n)  +  x(n  +  iV/2) 

Z(n  +  iV/2)  =  rr(n)  —  x(n  +  iV/2) 


0  <  n  <  M/A 


N/A  <n<  N/2 


N/2  <n<  3N/A 


3N/A  <  n  <  M 


(3.2) 


A  physical  implementation  of  the  BF2II  module  is  shown  in  Figure  | 


Zri  n+N/2) 

Zi(  n+N/2) 

Zr(n) 

Zi(n ) 


Figure  3.3:  BF2II  Module  [3] 
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3.3  Twiddle  Factors 


The  generation  of  the  twiddle  factors,  or  complex  roots  of  unity,  for  the  FFT 
will  determine  the  amount  of  error  between  the  VHDL  implementation  and  real  FFT 
calculation  in  MATLAB.  These  values  are  precomputed  and  their  binary  representa¬ 
tion  is  stored  in  a  ROM.  The  bit  width  selected  for  the  twiddle  factors  will  control 
the  amount  of  error.  Choosing  a  large  bit  width  guarantees  greater  accuracy  at  a 
cost  of  larger  die  area  and  higher  power  consumption.  On  the  other  hand,  a  small 
bit  width  generates  a  smaller  area  and  power  consumption  but  larger  error.  In  the 
VHDL  implementation,  they  are  based  on  the  module  length  M  and  the  FFT  length 
N.  The  equation  to  calculate  these  values  is 

Wp(n)  =  0  <  p  <  log 4(V)  -  2  (3.3) 

expanding  this  into  a  trigonometric  expression  yields 

Wp{n)  =  -  jsin(2nq^),  0  <p<  log  4(N)  -  2  (3.4) 


where 


q(n )  =  0*4 p  *  n 

=  2  *  4p  *  (n  -  ^) 
v  4  ’ 

=  1*4 p  *  (n - — ) 

v  2  ; 

=  3  *  4p  *  (n  -  ^)  (3.5) 

The  calculated  twiddle  factors  range  between  1  and  —1.  They  are  now  scaled  between 
the  max  and  min  values  based  on  the  twiddle  factor  bit  width.  For  example,  if  the 
bit  width  is  10,  the  twiddle  factor  would  be  scaled  to  a  value  between  511  (29  —  1)  to 
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Table  3.3:  Wx  Modules. 


Wx 

Contains  twiddle  factors  for  lengths... 

0 

1024,  512 

1 

1024,  512,  256,  128 

2 

1024,  512,  256,  128,  64,  32 

3 

1024,  512,  256,  128,  64,  32,  16,  8 

-512  (— 29).  The  real  and  imaginary  calculations  are  summarized  as: 

Real  =  round(cos(6)  *  ^TwiddleFactorBitWidth~1  -  1))  (3.6) 

Imag  =  raund(sin(9 )  *  (2TwiddleFactorBitWidth-1  -  1))  (3.7) 

Upon  storing  the  twiddle  factors  in  the  ROM,  they  must  be  offset  by  3M/4  samples 
to  ensure  they  are  aligned  with  the  first  sample  in  the  block.  If  the  FFT  is  configured 
for  dynamic  sizes,  all  possible  twiddle  factor  ROMs  must  be  made  available.  For 
example,  if  iV=16  and  nc= 2,  FFTs  of  length  16  and  8  can  be  calculated.  The  twiddle 

factors  for  iV=16  are  different  than  those  for  N=8.  In  this  case,  the  module  W3 

contains  both  ROMs  and  a  multiplexor  is  used  to  select  the  correct  twiddle  factors 
based  on  the  currently  selected  size.  This  is  the  case  for  all  dynamic  lengths  between 
1024  and  8.  Note,  length-4  FFTs  do  not  incorporate  twiddle  factors.  The  W3  module 
is  configurable  based  on  N  and  nc,  so  the  minimal  logic  is  created.  Table  13.3  shows 
the  Wx  modules  and  the  ROMs  they  may  contain.  Figure  103]  shows  the  schematic 
for  a  W3  module  for  a  1024  length  FFT  with  the  number  of  dynamic  choices  equal 
to  8.  The  output  of  each  ROM  is  passed  to  the  output  using  the  8-to-l  mux  with  the 
size  signal  performing  the  selection.  Smaller  dynamic  choices  use  smaller  muxes  to 
conserve  chip  space. 

The  twiddle  factors  are  generated  by  running  a  simulation  on  the  ROMGener- 
ate.vhd  hie.  The  configurable  parameters  are  specified  in  the  generic  listing.  This 
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Figure  3.4:  W3  Module  for  a  1024  point  dynamic  FFT 

code  will  generate  the  twiddle  factors  automatically  and  store  them  in  a  hie  called 
TuiiddleFactors.vhd.  This  hie  will  need  to  be  included  in  order  for  the  entire  design 
to  be  elaborated  and  synthesized. 

3.4  Timing 

Timing  is  essential  for  all  the  components  to  work  correctly.  The  timing  con¬ 
troller  is  simply  a  log2N- bit  up  counter.  The  timing  signals  are  passed  onto  the 
butterfly  modules  and  the  twiddle  factor  ROMs.  This  component  also  generates  the 
frame-out  signal  designating  the  completion  of  the  FFT  calculation  and  the  hrst  re¬ 
sult  will  appear  on  the  q_out  line.  The  frame-out  signal  is  generated  when  the  counter 
“rolls  over”  to  ’O’.  Due  to  the  dynamic  portion  of  the  hardware,  the  value  at  which 
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the  counter  rolls  over  depends  on  the  length  of  the  FFT  and  the  value  of  the  size 
signal.  If  a  1024  point  FFT  is  calculated,  a  10-bit  counter  is  used.  The  rollover  value 
would  then  be  1111111111.  If  the  FFT  is  configured  to  be  dynamic,  the  rollover  value 
is  shifted  to  the  right  by  the  integer  value  of  the  size  signal.  For  example,  for  a  1024 
point  FFT,  a  size  signal  of  001  would  shift  the  rollover  values  to  the  right  by  one 
producing  a  value  of  0111111111  which  corresponds  to  a  512  length  FFT.  Each  stage 
can  handle  two  different  FFT  lengths,  but  because  the  control  signals  to  each  stage 
are  static,  the  timing  controller  will  shift  the  count  values  left  one  bit  based  on  the  size 
signal.  If  pipelining  is  enabled,  the  control  signals  and  twiddle  factor  addresses  will 
also  have  to  be  pipelined.  This  is  handled  in  a  module  called  Timing  PL  R.  Excluding 
the  first  stage,  pipeline  registers  are  placed  around  the  multipliers  in  each  stage  if 
enabled.  This  placement  of  pipeline  registers  was  chosen  because  the  critical  path, 
or  longest  delay  in  a  circuit,  always  passes  through  the  multipliers.  Placing  pipeline 
registers  around  the  multipliers  shortens  the  critical  path,  thus  increasing  the  fre¬ 
quency  at  which  the  FFT  can  operate.  The  timing  signals  for  each  subsequent  stage 
must  then  be  delayed  by  two  clock  cycles  to  make  sure  they  meet  up  with  the  correct 
values  in  the  computation,  hence  the  two  pipeline  stages  in  the  TimingPLR  compo¬ 
nent.  Because  the  complex  multipliers  are  sandwiched  between  pipeline  registers,  the 
twiddle  factor  address  must  be  delayed  by  one  cycle  initially,  then  by  two  cycles  for 
each  subsequent  stage.  The  TimingPLR  module  takes  as  an  input  the  twiddle  factor 
address  signal,  but  has  two  outputs  for  the  address.  The  first  output  is  delayed  by 
one  cycle  to  be  used  in  the  current  stage  and  the  second  is  delayed  by  two  cycles  to 
be  used  in  the  next  stages.  Figure  13.51  shows  the  timing  controller  and  first  set  of 
pipeline  stages  for  a  length-1024  FFT. 

3.5  Stage  1 

Figure  13.6(a)  shows  a  diagram  of  the  stage  1  component,  and  Figure  13.6(b) 
shows  the  corresponding  flow  graph.  The  dashed  lines  correspond  to  complex  data. 
For  this  example  a  N/4  length  such  as  4,  16,  64,  etc  is  assumed.  Data  enters  the 
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Figure  3.5:  Timing  Controller  and  pipeline  for  a  1024  point  dynamic  FFT 


first  butterfly  module  at  the  x(n  +  M/2)  input.  The  first  two  values  are  passed  into 
the  2-stage  shift  register.  The  third  and  fourth  data  point  are  “butterflied”  and  the 
output  is  passed  onto  the  fixed  rounder.  These  represent  the  top  two  lines  in  the 
second  stage  shown  in  Figure  3.6(b).  The  outputs  to  go  into  the  bottom  two  lines  are 
put  back  into  the  shift  register  and  held  for  two  cycles  until  they  can  be  placed  into 
the  second  butterfly  module.  The  —  j  multiplication  is  build  into  the  BF2II  module. 
This  module  will  compute  the  last  two  2-point  DFTs  and  pass  the  output  to  the  final 
fixed  rounder  and  qjcmt  in  bit-reversed  order.  If  the  length  were  an  N/2  length  such 
as  8,  32,  128,  etc,  then  only  BF2I  would  be  used  to  compute  a  2-point  DFT  and  the 
BF2II  would  be  bypassed.  The  sign  extend  extends  the  bit  width  by  one  because 
the  BF2I  and  BF2II  modules  automatically  increase  the  bit  width  by  one  during 
operation. 


3.6  Stage  X 

Figure  13.71  shows  a  diagram  of  the  rest  of  the  stages.  These  generic  stages  are 
essentially  the  same  except  for  the  configurable  parameters  of  the  fixed  rounder,  the 
shift  registers,  and  Wx (n)  modules.  Operation  is  the  same  as  in  the  first  stage,  but 
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Figure  3.6:  Stage  1  of  FFT  and  Flow  Graph  Comparison 


this  time  the  twiddle  factors  and  a  complex  multiplier  are  included.  If  the  FFT 
is  configured  for  pipelining,  then  pipeline  registers  are  placed  before  and  after  the 
complex  multiplier. 


Figure  3.7:  Stage  X  of  FFT 


3.7  Completed  Design 


Connecting  the  timing  controller  and  the  different  stages  togethers  is  not  as  easy 
as  it  sounds.  Due  to  the  configurable  and  dynamic  nature  of  the  FFT,  all  possible 
scenarios  must  be  considered  while  keeping  the  hardware  usage  at  a  minimum.  As  an 
example,  a  1024  length  dynamic  length  (nc  =  8)  FFT  is  shown  in  Figure  13.81 
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Figure  3.8:  Layout  of  a  1024  length  dynamic  FFT 


This  configuration  has  data  select  logic  to  determine  which  stage  the  input  data 
should  begin.  For  example,  for  a  1024  or  512  length  the  input  data  should  enter  stage 
5.  For  a  length  256  or  128,  the  data  should  bypass  stage  5  and  begin  in  stage  4. 
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In  addition,  the  data  select  logic  selects  between  using  the  cLin  data  or  the  results 
from  the  previous  stage.  This  logic  is  essentially  a  mux  (2  or  4  input  mux  with 
the  unused  inputs  grounded).  The  select  bit(s)  logic  for  the  mux  is  generated  using 
Karnaugh  maps  and  the  size  signal.  If  a  stage  is  not  to  be  used,  then  the  inputs  are 
grounded  to  prevent  usage  of  the  elements  within  the  stage,  reducing  power  usage 
and  heat  generation.  The  sign  extender  between  each  stage  extends  the  bit  width  of 
the  input  data,  which  is  due  to  the  bit  growth  of  the  butterfly  modules  and  twiddle 
factor  multiplications.  With  a  static  1024  length  FFT,  there  would  be  no  data  select 
logic  blocks;  instead  the  data  would  pass  right  through.  Generate  statements  are 
used  heavily  for  this  portion  of  FFT  code.  The  structural  definition  of  the  code 
begins  by  instantiating  a  timing  controller.  After  that,  each  case  is  broken  down  by 
the  configured  length.  In  each  of  these  cases,  all  possible  lengths  (based  on  the  nc 
parameter)  is  broken  down.  Again,  to  prevent  unnecessary  chip  space  usage,  generate 
statements  are  used.  If  pipelining  is  configured,  then  pipeline  registers  are  placed 
to  control  the  arrival  time  of  the  butterfly  module  control  signals,  twiddle  factor 
addresses,  and  the  frame-out  signal.  The  logic  for  the  pipeline  controls  whether  the 
pipeline  is  turned  on  or  not.  In  the  1024/512  case,  all  pipeline  registers  should  be 
functioning.  For  the  256/128  case,  the  first  set  of  pipeline  registers  should  be  turned 
off  so  the  signals  arrive  correctly.  The  logic  for  controlling  this  functionality  is  again 
determined  by  Karnaugh  maps. 

3.8  Testing  Procedure 

There  are  two  major  areas  on  which  testing  will  be  performed.  First,  the  Ca¬ 
dence  Encounter  RTL  compiler  will  be  used  to  analyze  the  timing,  power,  and  chip 
area  used  by  all  the  configurations  of  the  FFT.  For  the  timing  analysis,  all  the  pos¬ 
sible  lengths  (4  to  1024)  and  all  possible  dynamic  sizes  (1  to  8)  will  be  synthesized 
for  both  the  pipelined  and  non-pipelined  versions.  Power  and  cell  usage  analysis  will 
be  performed  with  the  same  parameters,  except  only  the  pipelined  version  will  be 
examined.  This  will  give  an  overall  outlook  on  the  configurable  properties  and  what 
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effects  it  has.  A  PERL  script,  located  in  Appendix  A,  was  written  to  perform  these 
tests.  The  script  first  will  modify  the  configurable  parameters  in  the  FFT.vhd  file. 
Next,  the  Cadence  RTL  compiler  is  invoked  via  the  command  line  with  a  synthesizing 
script  passed  as  an  argument  to  the  compiler.  This  script  sets  up  the  90  nm  library, 
loads  the  VHDL  files,  synthesizes  the  design,  and  generates  reports  for  the  timing, 
cell  usage,  and  power  consumption.  This  script  is  also  located  in  Appendix  A. 

The  next  major  area  is  error  analysis.  This  will  be  performed  using  MATLAB 
and  Modelsim  for  simulations.  The  results  from  the  MATLAB  FFT  function  will  be 
compared  to  those  of  the  VHDL  FFT  implementation.  Even  though  there  is  negligi¬ 
ble  error  in  the  MATLAB  due  to  the  IEEE  754  floating  point  format,  the  MATLAB 
results  will  be  the  baseline  for  these  tests.  The  twiddle  factor  bit  width  and  it’s  effect 
on  data  will  be  analyzed.  By  varying  the  twiddle  factor  bit  width,  one  can  change  the 
amount  of  error  in  the  VHDL  FFT  compared  to  that  of  the  MATLAB  FFT  function. 
The  testing  procedure  for  this  is  as  follows:  A  cosine  function  is  generated  and  the 
data  points  are  sampled  and  placed  into  a  text  file.  The  FFT  VHDL  testbench  opens 
this  file  and  reads  the  data  as  inputs.  The  output  is  generated  and  placed  into  a 
different  text  file.  MATLAB  then  resumes  and  reads  in  the  VHDL  FFT  output  data. 
A  comparison  of  each  output  point  is  made  between  the  MATLAB  FFT  function  and 
the  VHDL  FFT  function.  A  stem  plot  is  generated  showing  the  %  error  between 
both  functions.  In  addition,  statistics  are  generated  for  this  set  of  data.  This  test  is 
performed  on  all  FFT  lengths  with  varying  twiddle  factor  bit  widths.  The  MATLAB 
m- files  which  perform  this  testing  are  found  in  Appendix  B.  Additionally,  a  frequency 
sweep  test  will  also  be  examined.  A  frequency  ranging  from  0  Hz  to  2.5  GHz  in  steps 
of  10  MHz  will  be  applied  to  the  FFT.  For  each  frequency  step,  a  length-256  FFT  will 
be  calculated  both  in  MATLAB  and  the  VHDL  FFT.  The  average  and  max  percent 
error  for  a  range  of  twiddle  factor  bit-widths  and  input  bit-widths  will  be  discussed. 
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3. 9  Chapter  Summary 

The  design  of  a  configurable/dynamic  FFT  processor  was  discussed  in  this  chap¬ 
ter.  The  overall  design  was  introduced  and  then  broken  down  into  many  smaller 
components.  The  design  analysis  was  performed  on  each  of  these  components,  along 
with  the  different  configurations  of  each.  Each  subcomponent  was  thoroughly  tested 
in  order  to  reduce  the  possible  errors  when  assembling  the  final  design.  Additionally, 
testing  procedures  were  developed  to  test  both  the  error  in  the  FFT  calculations  and 
the  physical  attributes  such  as  power  and  timing. 
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IV.  Analysis  and  Results 


This  chapter  analyzes  the  results  from  the  testing  procedures  discussed  in  the 
previous  chapter.  The  data  results  will  be  examined  first  in  the  error  analysis 
section.  Next,  the  area  and  timing  results  from  synthesis  will  be  evaluated. 


4-1  Error  Analysis 

4-1.1  Simple  Cosine  Curve.  Due  to  the  digital  nature  of  the  FFT  algorithm 
and  the  use  of  approximate  values  for  the  twiddle  factors,  there  will  be  error  in  the 
VHDL  results  as  compared  to  those  of  the  MATLAB  FFT  function.  A  comparison 
of  twiddle  factors  with  bit-widths  of  6,  8,  10  ,and  12  are  examined.  A  summary  is 
shown  in  Figure  14.1  and  figures  4.2  to  4.9  show  detailed  plots  of  the  error  between 
the  VHDL  and  MATLAB  FFT  functions  for  each  point  of  the  N-length  FFT.  The 
error  is  calculated  using  the  equation  error  =  -  H D^LIAxlabAB  *  100%-  The  analysis 
is  performed  on  absolute  values  of  the  real  and  imaginary  data  outputs.  The  input 
bitwidth  used  is  10  bits.  We  will  begin  the  analysis  with  the  length-8  FFT  as  this  is 
the  first  length  to  use  twiddle  factors  in  the  calculations.  An  FFT  of  length  4  does  not 
use  twiddle  factors,  therefore  there  is  no  error  between  the  VHDL  and  MATLAB  data. 
The  stem-and-leaf  plot  in  Figure  4.21  shows  the  percent  error  between  the  MATLAB 
and  VHDL  values  for  each  value  of  n.  The  results  show  relatively  small  error  except 
for  n  values  of  6  and  7.  This  is  due  to  the  twiddle  factor  approximation.  For  the  11=6 
case,  the  exact  twiddle  factor  would  be  (— cos(j):  —  j  *  sin(j£j).  Expanding  this  out 
yields  a  results  of  (—0.70710678,  — 0.70710678j).  Because  the  twiddle  factors  in  the 
VHDL  implementation  are  scaled  integer  values,  the  result  is  scaled  from  a  range  of 
(—1, 1)  up  to  a  range  of  (y—2t^bw~1  —  1,  2t^hw~1  —  1),  where  tfbw  is  the  twiddle  factor  bit 
width.  For  a  twiddle  factor  bit  width  of  10,  this  range  becomes  (—511,511).  Scaling 
the  11=6  twiddle  factors  results  in  values  of  (—361.331565, —361. 331565j).  These 
values  are  ultimately  rounded  to  (—361,  — 361j).  A  negligible  2%  error  is  introduced 
with  this  rounding.  Increasing  the  twiddle  factor  bit  width  will  reduce  this  error,  but 
with  an  increase  in  chip  area  and  power  consumption.  Figure  14.31  shows  the  plot  of 
the  FFT  data  and  error  analysis  for  a  length- 16  FFT.  The  results  are  similar  to  that 
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of  a  length-8  FFT.  With  length-32  and  length-64  FFTs,  two  sets  of  twiddle  factors 
are  now  used.  Because  of  this,  the  error  in  the  first  set  of  twiddle  factors  is  now 
multiplied  by  using  a  second  set  of  factors.  This  is  evident  in  Figure  H3I  as  the  errors 
for  each  twiddle  factor  length  are  now  larger  than  in  the  length-8  and  length-16  case. 
This  trend  continues  with  Figures  14.5  - 14.91  The  longer-length  FFTs  generate  more 
error  in  the  data  than  the  shorter  ones.  The  smaller  width  twiddle  factor  bit  width 
used  leads  to  a  larger  average  error.  Increasing  the  bit  width  from  6  to  8  shows  a 
large  decrease  in  average  error.  Increasing  the  bit  width  further  to  10  yields  better 
results,  but  beyond  that  the  decrease  in  error  is  negligible. 
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Figure  4.1:  Average  %  Error 


4-1.2  Frequency  Sweep.  For  the  frequency  sweep,  input  bit-widths  of  8,  10, 
12,  14,  and  16  along  with  twiddle  factor  bit- widths  of  the  same  values  will  be  analyzed. 
The  resulting  frequency  sweep  produces  a  plot  as  shown  in  Figure  (4.101  The  average 
percent  error  and  max  percent  error  data  is  shown  in  Figures  RblTI  and  14.121  The 
results  for  the  maximum  percent  error  show  for  a  twiddle  factor  bit-width  of  8  the 
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Figure  4.2:  Percent  error  between  MATLAB  and  VHDL  FFT  functions  for  N=8  and  various  twiddle  factor  bit-widths 
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Figure  4.3:  Percent  error  between  MATLAB  and  VHDL  FFT  functions  for  N=16  and  various  twiddle  factor  bit-widths 
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Figure  4.4:  Percent  error  between  MATLAB  and  VHDL  FFT  functions  for  N=32  and  various  twiddle  factor  bit-widths 
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Figure  4.5:  Percent  error  between  MATLAB  and  VHDL  FFT  functions  for  N=64  and  various  twiddle  factor  bit-widths 


700 


Twiddle  Factor  Bit  Width  =6 


Twiddle  Factor  Bit  Width  =8 


20 


250 


200 


150 

o 

LU 

100 


50 


0  20  40  60  80  100  120 


n 


CO 

CD 


Twiddle  Factor  Bit  Width  =10  Twiddle  Factor  Bit  Width  =12 


Figure  4.6:  Percent  error  between  MATLAB  and  VHDL  FFT  functions  for  N=128  and  various  twiddle  factor  bit-widths 
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Figure  4.7:  Percent  error  between  MATLAB  and  VHDL  FFT  functions  for  N=256  and  various  twiddle  factor  bit-widths 
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Figure  4.8:  Percent  error  between  MATLAB  and  VHDL  FFT  functions  for  N=512  and  various  twiddle  factor  bit-widths 
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Figure  4.9:  Percent  error  between  MATLAB  and  VHDL  FFT  functions  for  N=1024  and  various  twiddle  factor  bit-widths 


error  remains  the  same  no  matter  the  input  bit-width.  This  changes  with  twiddle 
factor  bit-widths  of  12  and  higher.  With  an  increase  in  input  bit-width,  maximum 
percent  error  drops  for  twiddle  factor  bit-widths  of  10  -  16.  The  average  percent  error 
decreases  with  either  an  increase  in  twiddle  factor  bit-width  and/or  input  bit-width. 


Figure  4.10:  FFT  Plot  of  Frequency  Sweep  for  0-2.5GHz 
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Figure  4.11:  Average  Error  in  Frequency  Sweep 


□  250-300 
■  200-250 

□  150-200 

□  100-150 
B  50-100 

□  0-50 


Figure  4.12:  Maximum  Error  in  Frequency  Sweep 
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4-2  Timing  Analysis 

For  this  section  and  the  following  sections,  an  analysis  will  be  performed  on  the 
physical  effects  of  an  FFT  processor  which  can  calculate  one  length  (fixed)  versus  one 
which  can  calculate  several  lengths  (dynamic).  The  critical  path  will  be  studied  now. 
A  comparison  between  the  non-pipelined  and  pipelined  versions  is  shown,  using  both 
a  350  nm  and  90  nm  technology.  Using  two  different  technology  libraries  will  show 
the  scaling  for  the  timing  analysis.  Table  14.1  shows  the  parameters  of  the  analysis. 
Figure  4.13  and  4.141  shows  the  results  of  the  analysis.  The  y-axis  shows  the  speed 


Table  4.1:  Parameters  for  Timing  Analysis 


Parameter 

Value 

log2N 

nc 

Input  Width 

Twiddle  Factor  Bit  Width 
Pipelining 

Software 

Cell  Libraries 

3  -  10 

1  -  8  (based  on  log2N) 

10 

10 

Off/On 

Cadence  Encounter  RTL  Compiler 

TSMC  High  Performance  General  Purpose  90  nm 
AMI  350  nm 

in  MHz,  while  the  x-axis  is  divided  up  by  the  different  lengths  (8  -  1024).  Each 
of  these  is  subdivided  by  the  number  of  allowable  dynamic  sizes.  A  trend  with  the 
non-pipelined  version  shows  maximum  frequencies  are  similar  for  lengths  using  the 
same  stages.  For  example,  a  length-128  and  length-256  FFT  use  the  same  hardware, 
therefore  the  speeds  are  similar.  Another  trend  is  the  larger  the  length,  the  smaller 
the  maximum  speed.  This  trend  is  due  to  large  adders  and  multipliers  which  occur 
because  the  bit  width  of  the  data  passing  through  the  FFT  is  not  rounded  or  clipped. 
Additionally,  if  a  dynamic  FFT  is  needed,  any  value  of  nc  >  1  produces  the  same 
results.  The  pipelined  version  shows  a  doubling  in  maximum  speed.  For  a  lcngth- 
8  FFT,  speeds  of  approximately  450  MHz  are  obtained.  On  the  other  end  of  the 
spectrum,  a  length-1024  FFT  can  run  between  200  -  250  MHz.  The  staggering  values 
are  due  to  various  critical  paths  in  the  processor.  The  90  nm  version  is  approximately 
6  times  faster  than  the  350  nm. 
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Figure  4.13:  Maximum  frequency  for  pipelined  and  non-pipelined  FFTs  with  input  bitwidth=10  and  twiddle  factor 

bitwidth=10  using  the  350  nm  technology. 


Figure  4.14:  Maximum  frequency  for  pipelined  and  non-pipelined  FFTs  with  input  bitwidth=10  and  twiddle  factor 

bitwidth=10  using  the  90  nm  technology. 


4-3  Total  Area  Analysis 

Table  14.2  shows  the  parameters  for  this  testing.  The  total  area  needed  for 
each  configuration  are  now  discussed,  using  both  the  350  nm  and  90  nm  libraries. 
Figure  14.16  shows  the  results  of  the  analysis.  As  expected,  with  each  increase  in 
length  the  total  area  increases.  Total  area  for  lengths  using  the  same  stages  are 
similar,  as  in  the  case  of  the  32/64  lengths.  The  90  nm  version  is  approximately  44 
times  smaller  in  area  than  the  350  nm. 


Table  4.2:  Parameters  for  Total  Area 


Parameter 

Value 

log2N 

nc 

Input  Bit  Width 
Twiddle  Factor  Bit  Width 
Pipelining 

Software 

Cell  Libraries 

2  -  10 

1  -  8  (based  on  log2N) 

10 

10 

On 

Cadence  Encounter  RTL  Compiler 

TSMC  High  Performance  General  Purpose  90  nm 
AMI  350  nm 

48 


(a) 


4^ 

CD 


log2n  -  nc 


(b) 


Figure  4.15:  Total  area  using  350  nm  technology  for  log2N  from  (a)  2  to  8  (b)  9  to  10 
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Figure  4.16:  Total  area  using  90  nm  technology  for  log2N  from  (a)  2  to  8  (b)  9  to  10 


4-4  Chapter  Summary 

This  chapter  analyzed  the  results  of  the  design.  The  error  analysis  shows  the 
error  resulting  from  different  twiddle  factor  bit-widths  compared  to  that  of  the  FFT 
function  found  in  MATLAB.  Longer-length  FFTs  generally  encounter  more  error  than 
shorter  lengths  do.  The  frequency  sweep  shows  how  the  input  and  twiddle  factor  bit- 
widths  affect  maximum  and  average  percent  error.  Additionally,  the  timing  and  total 
area  was  analyzed  for  all  possible  configurations  of  the  FFT  processor.  Increases  in 
hardware  for  longer-length  FFTs  provide  for  an  increase  in  total  die  area.  Small 
changes  are  noticed  with  the  increase  in  the  dynamic  size  parameter  nc. 
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V.  Conclusions 


5.1  Explanation  of  the  Problem 

The  problem  to  be  solved  was  to  accomplish  the  characterization  and  imple¬ 
mentation  of  a  FFT  using  a  fast  and  portable  design  strategy.  By  developing  this 
type  of  strategy  a  designer  can  create  digital  components  for  specific  functions  in  a 
matter  of  hours  or  days,  as  opposed  to  the  conventional  design  flow  which  could  take 
weeks  or  months.  Developing  standard  components  which  are  both  configurable  and 
dynamic  and  storing  them  in  a  library,  will  greatly  decrease  the  development  time  for 
producing  VLSI  components  for  digital  radar  applications. 

5.2  Summary  of  Background 

A  review  of  a  variety  of  book,  article,  and  internet  sources  was  performed  in 
order  to  understand,  investigate,  and  verify  the  methods  and  previous  technologies 
that  support  this  research.  A  brief  overview  of  the  Fourier  Transform,  DFT,  and  FFT 
was  discussed.  Additionally,  two  algorithms  in  computing  the  FFT  were  examined. 
One  of  the  algorithms,  the  DIF,  was  used  in  the  implementation.  Hardware  descriptor 
languages,  namely  VHDL  and  Verilog,  were  discussed  along  with  some  pros/cons  of 
each.  Lastly,  two  previous  implementations  were  analyzed. 

5.3  HDL  Code  Development:  Significance,  Limitations,  and  Further 
Research 

This  research  successfully  demonstrated  the  use  of  a  modular  mixed  signal  VLSI 
design  approach.  An  example  component,  the  FFT,  was  developed  and  demonstrated 
for  many  types  of  configurations.  The  90  nm  technology  library  allowed  a  design  to  be 
synthesized  using  a  smaller  area  and  power  consumption,  in  addition  to  faster  speeds. 
In  addition,  the  design  was  synthesized  in  a  350  nm  library  to  show  the  scaling  be¬ 
tween  the  two  technologies.  The  maximum  speed  at  which  this  FFT  processor  can 
run  is  greatly  enhanced  also.  In  [H]  the  author  demonstrates  a  speed  of  20  MHz  with 
a  length-1024  FFT.  In  this  research,  speeds  of  approximately  225  MHz  have  been 
simulated,  a  speedup  of  nearly  1100%!  Compared  to  the  Handel-C  version  by  [20] 
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in  which  the  author’s  implementation  can  attain  a  speed  of  82  MHz,  this  paper’s 
design  is  274%  faster.  Figure  15.11  shows  a  comparison  between  this  implementation 
and  two  other  implementations  Jl4]  [15]  on  FPGAs.  Although  maximum  frequency 
comparisons  on  similar  hardware  vary  between  +/-  10%,  this  implementation  has  its 
strength  in  being  modular  and  portable.  The  VHDL  is  portable  and  self-contained, 
as  the  twiddle  factors  are  generated  from  a  source  VHDL  hie.  Additionally,  because 
this  function  will  be  placed  into  a  library,  it  is  customizeable  and  dynamic.  Before 
synthesizing,  a  designer  can  modify  the  FFT  to  be  used  in  any  type  of  project.  Also, 
the  dynamic  properties  of  this  FFT  allow  it  to  calculate  different  length  FFTs  during 
run-time  with  the  simple  modification  of  one  signal. 

Initially,  designing  configurable  and  dynamic  components  is  a  lengthy  process. 
The  total  lines  of  VHDL  code  for  the  entire  design  is  well  over  11,000.  All  possible 
scenarios  must  be  accounted  for  and  tested.  For  this  implementation,  there  are  44 
possible  configurations  for  max  length  and  dynamic  lengths.  Adding  pipelining  op¬ 
tions  doubles  this  number  to  88,  which  leads  to  a  lengthy  testing  process.  Once  this 
is  complete  though,  this  design  can  be  tailored  to  almost  any  specific  need. 

As  with  any  type  of  computing  device,  there  are  several  areas  of  research  which 
can  be  expanded  on  to  improve  the  FFT  implementation.  Several  key  areas  to  explore 
include  expanding  the  1024  length  limit.  With  FFTs,  the  longer  the  length  the  more 
accurate  the  signal  representations  in  the  frequency  domain.  Also,  a  combination  of 
the  DIT  and  DIF  algorithms  will  briefly  be  discussed  as  this  will  decrease  the  number 
of  calculations  needed. 

5.3.1  Expanding  Beyond  the  1024-point  Limit.  Due  to  the  modularity  of 
the  design,  extending  the  maximum  N  value  past  1024  is  not  difficult.  A  listing  of 
the  components  which  would  need  modification  are  listed  below: 

ROMGenerate.vhd  The  twiddle  factors  Wq  to  W3  were  referenced  with  Wo  being 
the  twiddle  factor  for  the  highest  number  stage  (i.e.  stage  5)  in  the  design  and 
W3  being  the  factors  for  the  second  stage  (the  first  stage  does  not  use  twiddle 
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FFT  Length 


FFT  Length 


(a)  (b) 

Figure  5.1:  Previous  implementation  comparisons 

factors).  This  arranging  of  the  variables  is  due  to  the  way  they  are  presented 
in  the  DIT  and  DIF  algorithms.  With  this  being  said,  to  extend  past  1024  the 
twiddle  factor  variables  must  be  shifted.  For  example,  if  one  wanted  to  use  a 
max  length  of  4096,  the  twiddle  factors  would  range  from  Wo  to  W4  with  the 
new  stage  6  using  Wo-  The  only  changes  needed  for  the  ROM  Generate. vhd  code 
would  be  to  the  main  function.  Here,  one  would  change  the  Mp_array  variable, 
the  p  for  loop,  and  the  Mp  calculation. 

FFT. vhd  To  add  2048  as  a  possible  length,  a  new  generate  statement  will  be  needed 
(i.e.  Neq2048).  By  following  the  previous  size  implementation,  it  is  easy  to 
see  the  pattern.  All  the  possible  nc  choices  will  have  to  be  covered  also.  This 
allows  the  minimal  components  necessary  for  each  value  of  nc.  It  is  helpful  to 
create  a  drawing  similar  to  the  completed  design  layout  shown  in  Chapter  II  to 
implement  the  necessary  muxes  and  sign  extenders.  Karnaugh  maps  are  very 
useful  here. 

Components. vhd  To  follow  along  with  the  design  by  He  and  Torkelson  |10],  the 
twiddle  factors  are  numbered  from  0  to  3  going  from  left  to  right  in  the  de¬ 
sign.  To  accommodate  a  larger  length,  these  values  must  be  ’shifted’  to  the 
left.  Renumbering  the  twiddle  factors  in  this  module  and  adding  a  new  W4 
component  will  work. 
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All  other  components  are  modular  and  do  not  need  any  modifications. 

5.3.2  Implementation  of  a  Decimation- In- Time- Frequency  Algorithm.  Ali 
Saidi  developed  an  algorithm  which  he  claims  reduces  the  number  of  real  multipli¬ 
cations  and  additions  pTT] .  The  algorithm  is  called  “Decimation- In-Time-Frequency 
(DITF)  FFT  Algorithm.”  This  reduces  the  arithmetic  complexity  while  using  the 
same  structure  as  the  conventional  Cooley- Tukey  FFT  algorithm.  He  extended  the 
algorithm  to  the  radix-2  FFT  implemented  in  this  research.  The  author  explains  the 
heart  of  the  DITF  algorithm  is  based  on  this  observation:  in  the  DIF  algorithm  most 
of  the  calculations  are  performed  in  the  early  stages  of  the  algorithm  while  in  the  DIT 
algorithm  most  of  the  calculations  are  done  in  the  final  stages  of  the  algorithm  [117] . 
The  author  proposes  starting  with  the  DIT  FFT  algorithm  and  then  switching  to 
the  DIF  FFT  algorithm  as  some  intermediate  stage  will  decrease  the  amount  of  com¬ 
putations  needed.  The  flow  graph  in  Figure  15.21  illustrates  a  32-point  DITF  FFT 
algorithm.  The  cost  of  the  transition  from  DIT  to  DIF  and  the  savings  due  to  this 
transition  vary  depending  on  the  stage  at  which  the  algorithms  switch.  An  analysis 
is  performed  by  the  author  in  this  article  [17]. 

Table  ts. I  shows  the  number  of  real  multiplies  for  several  lengths  (N)  for  both  the 
Radix-2  Cooley-Tukey  and  the  DITF  algorithm,  along  with  several  other  algorithms. 
The  data  verifies  the  number  of  multiplications  is  smaller  for  the  DITF  algorithm, 
especially  for  larger  lengths.  By  decreasing  the  number  of  operations  necessary  to 
compute  the  FFT,  the  calculation  overall  will  be  performed  faster. 
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32-Point  DITF  FFT  Flow  Graph 
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Table  5.1:  Number  of  real  multiplies  for  complex  FFT  algorithms.  pTT] 


Size 

Split 

Radix- 2 

Radix- 2 

Radix-4 

Radix-4 

Radix- 8 

Radix-8 

M 

N 

RADIX 

CT 

DITF 

CT 

DITF 

CT 

DITF 

3 

8 

4 

4 

4 

N/A 

N/A 

N/A 

N/A 

4 

16 

24 

28 

24 

24 

24 

N/A 

N/A 

5 

32 

84 

108 

88 

N/A 

N/A 

N/A 

N/A 

6 

64 

248 

332 

248 

264 

264 

248 

248 

7 

128 

660 

908 

696 

N/A 

N/A 

N/A 

N/A 

8 

256 

1656 

2316 

1784 

1800 

1656 

N/A 

N/A 

9 

512 

3988 

5644 

4472 

N/A 

N/A 

3992 

3992 

10 

1024 

9336 

13324 

10744 

10248 

9528 

N/A 

N/A 

11 

2048 

21396 

30732 

25336 

N/A 

N/A 

N/A 

N/A 

12 

4096 

48248 

69644 

58360 

53256 

49656 

48280 

47608 
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Appendix  A.  FFT  Synthesis  Testings  Scripts 


Listing  A.l:  PERL  Synthesis  Script 

#!/usr/bin/perl  — w 

Make  sure  the  files  we  need  exist  ... 

if  (!  — e  "  FFT  .  vhd  "  )  { 

print  "Missing  FFT.vhd\n"; 
exit  (  )  ; 

} 

i  f ( !  —  e  " FFTComponents  . vhd " )  { 

print  "Missing  FFTComponents. vhd\n"; 
exit  (  )  ; 

} 

i  f ( !  —  e  "ROMGenerate.  class")  { 

print  "Missing  ROMGenerate . class\n" ; 
exit  (  )  ; 

} 

if  (!  — e  "  FFT  .  cmd  "  )  { 

print  "Missing  FFT . cmd\n" ; 
exit  (  )  ; 

} 

#  For  this  set  of  tests  ,  fix  the  input  bitwidth  and 

#  the  twiddle  factor  bitwidth  to  10  bits  each. 

$in_width  =  10; 

$tf_width  =  10; 


#  Loop  through  all  possible  lengths  ... 

for  ( $log2N  =  2  ;  $log2N  <=  10  ;  $log2N++)  { 

#  determine  the  max  nc  value. 

SmaxNC  =  $log2N  -  1; 

i  f  (  $log2N - 10)  { 

SmaxNC  =  8; 

} 

loop  through  all  possible  nc  values 
for  ($nc  =  1  ;  Snc  <=  SmaxNC  ;  SncH — |-)  { 

print  "Synthesizing  $log2N  $nc  $tf_width  $in_width\n"; 

calculate  the  output  width  ,  which  is  based  on 

#  log2N  ,  input_width  ,  and  tf_width 
Stemp  =  Slog2N  +  (  Slog2N  %  2); 

$out_width  =  $in_width  +  Stemp  *  1  +  ($temp/2  —  1 )  *  $tf_width; 

#  Copy  FFT.template  .  vhd  to  FFT.  vhd 
‘  cp  FFT.template  .  vhd  FFT  .  vhd  ‘  ; 


#  Modify  the  parameters  of  the  FFT.  vhd  file  ... 
print  "  Modifying  FFT. vhd.  ..\n"; 

‘perl  —pi  — e  ’s/#in_width/$in_width/g’  FFT.vhd4; 
‘perl  —pi  — e  1 s/#out_width/$out_width/g’  FFT. vhd ‘ 
‘perl  —pi  — e  ’s/#tf_width/$tf_width/g’  FFT.vhd'; 
‘perl  —pi  — e  ’s/#nc/$nc/g’  FFT.vhd4; 

‘perl  —pi  — e  ’  s /#  1  o  gb  as  e  2  N  /  $  1  o  g2  N  /  g  ’  FFT.vhd1; 
‘perl  —pi  — e  ’  s /#  p  i  p  e  1  i  ne  /  YES  /  g  ’  FFT.vhd1; 


#  Generate  the  twiddle  factors 

print  "  Generating  TwiddleFactors. vhd . . .\n" ; 
‘~/bin/java  ROMGenerate  $log2N  Snc  Stf.width  vhdl  ‘  ; 
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#  If  log2N  =  2  (length  4)  create  a  blank  TwiddleFactors  .  vhdl 

#  so  Cadence  won’t  choke 

i  f  (  $log2N  ==  2)  { 

‘touch  TwiddleFactors  .  vhd  ‘  ; 

} 

#  execute  RTL  script  ... 

print  "  Running  synthesis  script  ... \n”  ; 

‘rc  —  files  FFT  .  cmd  ‘  ; 

#  copy  output  files  to  specific  file 

print  "  Copying  results  to  synth.results  directory . . .\n" ; 
$timing_fi  len  am  e=j  oin  ’  ’  ,  ’  timing.  ’  ,$log2N  ,  ’  _  ’  ,$nc,  ’_PIPE.txt’ 
$area_file  name=j  oin  ’  ’  ,  ’  area.  ’  ,$log2N  ,  ’  _  ’  ,  $nc  ,  ’_PIPE.txt  ’  ; 
$power_filename=join  ’  ’  ,  '  power.  ’  ,$log2N  ,  ’  _  ’  ,$nc  ,  ’_PIPE.txt  ’  ; 
‘mv  timing.txt  synth.results  /  $timing_filename 
‘mv  area.txt  synth.results  /  Sarea.filename 
‘mv  power.txt  synth.results  /Spower.filename  ‘; 


Listing  A. 2:  Cadence  Synthesis  Script 

#  Cadence  RTL  Compiler  (RC) 

#  version  05.20  -p002  (32-bit)  built  Nov  28  2005 

# 

#  Run  with  the  following  arguments: 

#  — logfile  rc  .  log 

#  — cmdfile  rc.cmd 

#  setup  the  library  search  path  to  the  9 Onm  libraries  from  TSMC 

set.attribute  lib.search.path  /  home  /  afiten3  /  gce07m  /  bbrakus  /  libraries  /TSMCHOME/  digital  /  Front  .End  / ... 
timing.power  /  tcbn90ghp_150a 

setup  the  hdl  search  paths  to  the  current  directory 
set.attribute  hdl.search.path 

#  load  one  of  the  90 nm  libraries 
set.attribute  library  tcbn90ghpbc  .  lib 

#  read  all  the  vhdl  files 

read.hdl  —vhdl  TwiddleFactors  .  vhd  FFTComponents .  vhd  FFT  .  vhd 

compile  and  check  for  errors 
elaborate  FFT 

#  synthesize  the  design 
synthesize  —  to  .mapped  FFT 

#  create  reports  and  save  them  to  the  current  directory 


report 

report 

report 

timing  >  timing.txt 

area  >  area . txt 

power  >  power  .  txt 

quit 
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Appendix  B.  FFT  Error  Analysis  Testings  Scripts 


Listing  B.l:  MATLAB  Error  Analysis  Script 

function  [  ]  =  TestFFTerror  (  log21ength  ,  input_bit width ) 

%TestFFTerror  Generates  test  input  and  FFT  results  data 
%  TestFFTerror  (  log2  lengt  h  ,  i  n  p  u  t  _b  i  t  w i  d  t  h  ) 

%  log21ength  =  log  base  2  of  FFT  length 

%  i  n  p  u  t  _b  i  t  w  i  d  t  h  =  bitwidth  of  input  data 


length  =  2  ~  log21engt  h  ; 

in.scale  =  2  ~  (  input _b i t wi d t h  —  1)  —  1 ; 

n  =  [0:29]  ; 

data=cos (2* pi *n/10) ; 
data=round( data  *  in.scale); 


%  open  file  to  store  input  data 
i  n  _ i d  =f o  pen(  ’  input.data . txt  ’  ,  1 wt  ’ )  ; 
if  (  in_id  ==  —  1) 

error  (’  c  anno  t  open  file  for  writing’); 


end 

%  store  input.data 
for  j  =1:30 

fprint  f  (  in_id  , 
fprintf(in_id  , 

end 

for  j=31:length 

fprint  f  (  in_id  , 
fprint  f  (  in_id  , 

end 

fclose(in.id); 


in  file  ... 

’7,  d\n’  ,  real(data(  j  )  )  )  ; 
’  7. d \ n  ’  ,  imag  (  data  (  j  )  )  )  ; 

’ 0\ n  ’  )  ; 

’ 0\ n  ’  )  ; 


for  tfbw  =  6:2:12 

%  scale  factor 
tf.scale  =  2“  (  tfbw  — 1)  — 1; 


%  calculate  FFT  of  data... 
matlab  =  (  fft  (data  ,  length)  )  ; 


%  scale  data  based  on  length  .  .  . 
if  (  length  ==  8  ||  length  ==  16) 

matlab  =  matlab  *  tf.scale  “1; 
e  1  s  e  i  f  (  length  ==  32  ||  length  ==  64) 
matlab  =  matlab  *  tf.scale  ~  2 ; 
elseif  (  length  ==  128  ||  length  ==  256) 
matlab  =  matlab  *  tf.scale  “3; 
e  1  s  e  i  f  (  length  ==  512  ||  length  —  =  1024) 
matlab  =  matlab  *  tf.scale  ~  4 ; 

end 


%bit  reverse  data  .  .  . 
rev=zeros (1,  length)  ; 
for  j=0:length  —  1 

binstr  =  dec2bin(j  ,  log2 ( lengt h  ))  ; 
binstr  =  fliplr (binstr); 
bitrev  =  b  i  n  2d ec ( b i n s t r )  ; 
r  e  v  (  b  i  t  r  e  v +1)  =  matlab(j+l); 

end 


%  open  file  to  store  matlab  FFT  data 

fftfilename  =  strcat  (  ’  f  ft.matlab.  ’  ,  num2str(length),  num2str(tfbw),  ’.txt’); 
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f  f  t  _i  d=fopen  (  fftfilename  ,  ’  wt  ’  )  ; 

if  (  fft.id  ==  -1) 

error  (’  cannot  open  file  for  writing’); 

end 

%  store  FFT  data  in  file... 
for  j=l:length 

fp  r  i  nt  f  (  f  f  t  _i  d  ,  ’*/,d  %d\n’  ,  round  (  real  (  rev  (  j  ))  )  ,  round  (  imag  (  re  v  (  j  ))))  ; 

end 

fclose  (  fft_id  )  ; 

disp(  ’Run  the  Modelsim  simulator  to  generate  VHDL  data  using  the  following  parameters  :  ’)  ; 
dispstr  =  strcat  (  ’  input.width  =  ’  ,  num2str  (  i  n  p  u  t  _b  i  t  w  i  d  t  h  )  )  ; 

disp(dispstr)  ; 

temp  =  log21ength  +  mod(  log21ength  ,2)  ; 

output.width  =  i  n  p  u  t  _b  i  t  wi  d  t  h  +  temp  *  1  +  (temp/2  —  l)*tfbw; 
d  i  s  p  s  t  r  =  s  t  r  c  a  t  (  ’  o  ut  p  u  t  _  w  i  d  t  h  =  ’  ,  num2str(  output.width  )  )  ; 

disp(dispstr)  ; 

dispstr  =  strcat(’tfbw  =  ’,  num2str(tfbw )  )  ; 

disp(dispstr)  ; 

dispstr  =  strcat  (  ’  log2N  =  ’  ,num2str(  log21ength))  ; 

disp(dispstr)  ; 
disp  (  •  ■  )  ; 

disp  (  ’  Press  any  key  when  done...’); 
pause  ; 

%  rename  the  generated  VHDL  FFT  data 

vhdlfilename  =  strcat  (  ’  f  f  t.vhdl.  ’  ,  num2str  (  length)  ,  ’_’  ,  num2str  (  tfbw),  ’  .  txt  ’  )  ; 
movefile  (  ’fft_vhdl  .txt  ’  ,  vhdlfilename)  ; 

end 

CompareDataStem  (  log21ength  )  ; 
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Listing  B.2:  MATLAB  Stem  Plot  Script 

function  [  ]  =  CompareDataStem  ( log2N  ) 

%CompareDataStem  Compares  error  results  and  produces  a  stem  plot 
%  Detailed  explanation  goes  here 
n=2~log2N  ; 

for  tfbw  =  6:2:12 

mat  labFilename=s  treat  (  ’  f  f  t.matlab.  ’  ,  int2str(n)  ,  ’  _  ’  ,  int2str(tfbw)  ,  ’  .  txt  ’  )  ; 

vhdlFilename=s  t  rcat  (  ’  f  f  t.vhdl.  ’  ,  int2str(n),  ’  _  ’  ,  int2str(tfbw),  ’.txt’); 
matlabID  =  fopen  (  matlabFilename  ,  ’  r  ’  )  ; 

diffFilename  =  strcat  ’  ,int2str(n)  ,  ’_’,int2str(tfbw)  ,  ’.txt’); 

vhdllD  =  fopen  (  vhdlFilename  ,  ’r’); 

diffID  =  fopen  (  diffFilename  ,  ’  wt  ’  )  ; 

if  (  diffID  ==  -1) 

error  (  ’ cannot  open  file  for  writing’); 

end 


matlabTHM  =  fscanf  ( matlabID  ,  ’ f  "/.  f  ’  ,  [2  inf])  ; 
vhdlTHM  =  fscanf  ( vhdllD  ,  ’7,  f  7,  f  ’  ,  [2  inf])  ; 
matlabTHM=matlabTHM  ’  ; 
vhdlTHM=vhdlTHM  ’  ; 

matlabRE  =  (matlabTHM  (  :  ,  1  )  )  ; 
matlabIM  =  ( matlabTHM  (:  ,2)  )  ; 
vhdlRE  =  ( vhdlTHM  (  :  ,  1 )  )  ; 
vhdllM  =  ( vhdlTHM (:  ,2)  )  ; 


n=length  (  matlabRE  )  ; 
reD  iff=zeros  (1  ,  n)  ; 
imDiff=zeros  (1  ,  n)  ; 
for  i  =l:n 

i  f  (  matlabRE  (  i  )  ~=  0  &:&  matlabIM  (  i  )  '=  0) 

reDiff(i)  =  (  1  00  *(  vhdlRE  (  i  )—  matlabRE  (  i  )) /matlabRE  (  i  ))  ; 
imDiff(i)  =  (  1  00  *(  vhdllM  (  i  )  —  matlabIM  (  i  )) /matlabIM  (  i  ))  ; 

else 


reDiff  (  i  )  =  0; 
imD iff  (  i  )  =  0; 

end 


fprintf(diffID  ,  ’7.  f  7.  f\n’  ,  reDiff(i)  ,  imDiff(i)); 

end 

reMaxError  =  max(  reDiff)  ; 
reMinError  =  min(reDiff); 
imMaxError  =  max(imDiff)  ; 
imMinError  =  min(imDiff)  ; 


titlestring  =  strcat  (’  Twiddle  Factor  Bit  Width  =  ’  ,  i  nt  2  st  r  (  tfbw  )  )  ; 

subplot  (2  ,2  ,(tfbw— 4)/2)  ; 
if  ((tfbw— 4) /2  ==  1) 

stem  (  [  1  :  n]  , reDiff  ,  ’  bx  ’  )  ,  xlim  (  [0  n  +  1]  )  ; 

hold  on 

stem  ([l:n],imDiff,  ’r+’),title(titlestring),xlabel(’n’),ylabel(’%  Error’),  xlim  (  [  0  n  +  1]); 
e  1  s  e  i  f  (  (  tfbw  — 4) /2  ==  2) 

stem  (  [  1  :  n]  ,  reDiff,  ’bx’);,xlim([0  n  +  1]); 
hold  on 

stem  ([l:n],imDiff,  ’r+’),title(titlestring),xlabel(’n’),ylabel(’%  Error’), xlim([0  n  +  1]); 
e  1  s  e  i  f  (  (  tfbw  — 4) /2  ==  3) 

stem  (  [  1  :  n]  ,  reDiff,  ’bx’);,xlim([0  n  +  1]); 
hold  on 
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stem  ([l:n],imDiff,  ’r+’),title(titlestring),xlabel(’n’),ylabel(’%  Error’  )  ;  ,  xlim  (  [  0  n  +  1] )  ; 

else 

stem  ([l:n],reDiff,  ’bx’);,xlim([0  n+1]); 
hold  on 

stem  (  [  1  :  n  ]  ,imDiff  ,  ’  r  +  ’  )  ,  title  (titlestring)  ,xlabel(’n’)  ,  y label  (’  %  Error’)  ,  xlim  (  [  0  n  +  1]); 

end 


fclose  (  matlabID  )  ; 
fclose  (  vhdllD  )  ; 
fclose  (  diffID  )  ; 


end 


%  open  file  to  store  error  data 

e  r  r  o  r  f  i  1  e  n  a  m  e  =  strcat  (  ’  images  \  N  ’  ,  num2str(n)  ,  ’  TF  ’  ,  num2str  (  tfbw  ) 

error  _i  d=fopen  (errorfilename  ,  ’  wt  ’  )  ; 

if  (  error.id  =  =  —1) 

error  (  ’cannot  open  file  for  writing’); 


end 

reMaxError  =  max(abs(  reDiff  )  )  ; 
reAveError  =  average  (  reDiff)  ; 
reStdDev  =  std(reDiff); 
imMaxError  =  max(  abs  (  imDiff  )  )  ; 
imAveError  =  average  (  imDiff  )  ; 
imStdDev  =  std  (  imDiff); 

fprintf(  error.id  ,  ’Max  real  error  =  */,  f\n 
fprintf ( error.id  ,  ’Average  real  error  = 
fprintf ( error.id  ,  ’Standard  deviation  of 
fprintf  (  error.id  ,  ’Max  imag  error  =  °/,  f\n 
fprintf ( error.id  ,  ’Average  imag  error  = 
fprintf ( error.id  ,  ’Standard  deviation  of 
fclose ( error.id ) ; 


’  ,  reMaxError)  ; 

'/,  f  \  n  ’  ,  reAveError); 

real  =  '/,  f  \  n  ’  ,  reStdDev); 
’  ,  imMaxError  )  ; 

*/.f  \n  ’  ,  imAveError); 
imag  =  '/,  f  \  n  ’  ,  imStdDev); 


’  error  . txt  ’  )  ; 
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